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1  Introduction 

This  paper  explores  one  dimension  along  which  word 
spotting  and  speech  recognition  differ:  the  nature  of  the 
background  model.  In  word  spotting,  a  relatively  small 
number  of  keywords  float  on  a  sea  of  unknown  words. 
In  speech  recognition,  an  occasional  unknown  word 
punctuates  utterances  that  are  otherwise  completely  in¬ 
vocabulary.  Despite  this  difference  in  viewpoint,  in 
some  circumstances  implementations  of  the  two  may 
become  very  similar.  When  transcribed  data  is  avail¬ 
able  for  a  domain,  word  spotting  benefits  from  the 
more  detailed  background  model  this  can  support  [9]. 
The  manner  in  which  the  background  is  modeled  in 
these  cases  is  reminiscent  of  speech  recognition.  For 
example,  a  large  vocabulary  with  good  coverage  may 
be  extracted  from  the  corpus,  so  that  relatively  few 
words  in  an  utterance  remain  unmodeled.  In  this  case, 
the  situation  is  qualitatively  similar  to  OOV  modeling 
in  a  conventional  speech  recognizer,  except  that  the 
vocabulary  is  strictly  divided  into  “filler”  and  “key¬ 
word”. 

This  paper  describes  a  mechanism  for  bootstrapping 
from  a  relatively  weak  background  model  for  word- 
spotting,  where  OOV  words  dominate,  to  a  much 
stronger  model  where  many  more  word  or  phrase  clus¬ 
ters  have  been  “moved  to  the  foreground”  and  explic¬ 
itly  modeled.  With  this  increase  in  vocabulary  comes 
an  increase  in  the  potency  of  language  modeling, 
boosting  performance  on  the  original  vocabulary. 

The  following  sections  show  how  a  conventional 
speech  recognizer  can  be  convinced  to  cluster  fre¬ 
quently  occurring  acoustic  patterns,  without  requiring 
the  existence  of  transcribed  data. 

2  Boot-strapping  the  lexicon 

A  recognizer  with  a  phone-based  OOV  model  is  able  to 
recover  an  approximate  phonetic  representation  for 
words  or  word  sequences  that  are  not  in  its  vocabulary. 
If  commonly  occurring  phone  sequences  can  be  b- 
cated,  then  adding  them  to  the  vocabulary  will  allow 


the  language  model  to  capture  their  co-occurrence  with 
words  in  the  original  vocabulary,  potentially  boosting 
recognition  performance.  This  suggests  building  a 
“clustering  engine”  that  scans  the  output  of  the  speech 
recognizer,  correlates  OOV  phonetic  sequences  across 
all  the  utterances,  and  updates  the  vocabulary  with  any 
frequent,  robust  phone  sequences  it  finds.  While  this  is 
feasible,  the  kind  of  judgments  the  clustering  engine 
needs  to  make  about  acoustic  similarity  and  alignment 
are  exactly  those  at  which  the  speech  recognizer  is 
most  adept.  This  section  describes  a  way  to  convince 
the  speech  recognizer  to  perform  clustering  almost  for 
free,  eliminating  the  need  for  an  external  module  to 
make  acoustic  judgments. 

The  clustering  procedure  is  shown  in  Figure  1.  An 
rt gram- based  language  model  is  initialized  randomly, 
or  trained  up  using  whatever  data  is  available  -  for  ex¬ 
ample,  a  small  collection  of  transcribed  utterances. 
Unrecognized  words  are  explicitly  represented  using  a 


Figure  1:  The  iterative  clustering  procedure. 


distribution  statement  a 

Approved  for  Public  Release 
Distribution  Unlimited 


i 


phone-based  OOV  model,  described  in  the  next  sec¬ 
tion.  The  recognizer  is  then  run  on  a  large  set  of  in- 
transcribed  data.  The  phonetic  and  word  level  outputs 
of  the  recognizer  are  compared  so  that  occurrences  of 
OOV  words  are  assigned  a  phonetic  transcription.  A 
randomly  cropped  subset  of  these  are  tentatively  en¬ 
tered  into  the  vocabulary,  without  any  attempt  yet  to 
evaluate  their  significance  (e.g.  whether  they  occur 
frequently,  whether  they  are  dangerously  similar  to  a 
keyword,  etc.).  The  hypotheses  made  by  the  recog¬ 
nizer  are  used  to  retrain  the  language  model,  making 
sure  to  give  the  newly  added  vocabulary  items  some 
probability  in  the  model.  Then  the  recognizer  runs  us¬ 
ing  the  new  language  model  and  the  process  iterates. 
The  recognizer’s  output  can  be  ised  to  evaluate  the 
worth  of  the  new  “vocabulary”  entries.  The  following 
sections  detail  how  to  eliminate  vocabulary  items  the 
recognizer  finds  little  use  for,  and  how  to  detect  and 
resolve  competition  between  similar  items. 

2.1  Extracting  OOV  phone  sequences 

The  recognizer  used  the  OOV  model  described  in  [1], 
contributed  by  Issam.  This  model  can  match  an  arbi¬ 
trary  sequence  of  phones,  and  has  a  phone  bigram  to 
capture  phonotactic  constraints.  The  OOV  model  is 
placed  in  parallel  with  the  models  for  the  words  in  the 
vocabulary.  A  cost  parameter  can  control  how  much 
the  OOV  model  is  used  at  the  expense  of  the  in¬ 
vocabulary  models.  This  value  was  fixed  at  zero 
throughout  the  experiments  described  in  this  paper, 
since  it  was  more  convenient  to  control  usage  at  the 
level  of  the  language  model.  The  bigram  used  in  this 
project  is  exactly  the  one  used  in  [1],  with  no  training 
for  the  particular  domain. 

2.2  Recovering  phonemic  representations 

It  is  useful  to  convert  the  extracted  phone  sequences  to 
phonemes  if  they  are  to  be  added  as  baseforms  in  the 
lexicon.  Although  the  sequences  could  be  kept  in  their 
original  form  by  creating  a  dummy  set  of  units  for  the 
baseforms  that  are  passed  verbatim  by  the  phonological 
rules,  converting  to  phonemes  adds  some  small  amount 
of  generalization  over  allophones  to  the  sequence’s 
pronunciation,  and  reduces  the  amount  of  competing 
forms  that  have  to  be  dealt  with  later  (see  Section  2.4). 
I  make  the  conversion  in  a  naive  way,  classifying  sin¬ 
gle  or  paired  phonetic  units  into  a  set  of  equivalence 
classes  that  correspond  to  phonemes.  For  example, 
taps  and  cleanly  enunciated  stops  are  mapped  to  the 
same  phoneme,  with  explicit  closures  being  dropped. 
Although  the  procedure  does  not  capture  some  contex¬ 
tual  effects,  it  achieves  perfectly  adequate  performance 
(see  Section  3). 


Phoneme  sequences  are  given  an  arbitrary  name  and 
added  to  the  list  of  vocabulary  and  baseforms.  To  en¬ 
sure  that  the  language  model  assigns  some  probability 
to  these  new  vocabulary  items  the  next  time  the  recog¬ 
nizer  runs,  a  collection  of  randomly  generated  sen¬ 
tences  is  added  to  those  output  of  the  recognizer  used 
in  re-training. 

2.3  Dealing  with  rarely-used  additions 

If  a  phoneme  sequence  introduced  into  the  vocabulary 
is  actually  a  common  sound  sequence  in  the  acoustic 
data,  then  the  recognizer  will  pick  it  up  and  use  it. 
Otherwise,  it  just  will  not  appear  very  often  in  ty- 
potheses.  After  each  iteration  a  histogram  of  phoneme 
sequence  occurrences  in  the  output  of  the  recognizer  is 
generated,  and  those  below  a  threshold  are  cut. 

2.4  Dealing  with  competing  additions 

Very  often,  two  or  more  very  similar  phoneme  se¬ 
quences  will  be  added  to  the  vocabulary.  If  the  sounds 
they  represent  are  in  fact  commonly  occurring,  both  are 
likely  to  prosper  and  be  used  more  or  less  inter¬ 
changeably  by  the  recognizer.  This  is  unfortunate  for 
language  modeling  purposes,  since  their  statistics  will 
not  be  pooled  and  so  will  be  less  robust.  Happily,  the 
output  of  the  recognizer  makes  such  situations  very 
easy  to  detect.  In  particular,  this  kind  of  confusion  can 
be  uncovered  through  analysis  of  the  N-best  utterance 
hypotheses. 

If  we  imaging  a  set  of  N-best  hypotheses  aligned  and 
stacked  vertically,  then  competition  is  indicated  if  two 
vocabulary  items  exhibit  both  of  these  properties: 

■  Horizontally  repulsive  -  if  one  of  the  items  ap¬ 
pears  in  a  single  hypothesis,  the  other  will  not  ap¬ 
pear  in  its  vicinity. 

■  Vertically  attractive  -  the  items  frequently  occur 
in  the  same  part  of  a  collection  of  hypotheses  for  a 
particular  utterance. 

Since  the  utterances  in  this  domain  are  generally  short 
and  simple,  it  did  not  prove  necessary  to  rigorously 
align  the  hypotheses.  Instead,  items  were  considered 
to  be  aligned  based  simply  on  the  vocabulary  items 
preceding  and  succeeding  them.  It  is  important  to 
measure  both  the  attractive  and  repulsive  conditions  to 
distinguish  competition  from  vocabulary  items  that  are 
simply  likely  or  unlikely  to  occur  in  close  proximity. 

Accumulating  statistics  about  the  above  two  properties 
across  all  utterances  gives  a  reliable  measure  of 
whether  two  vocabulary  items  are  essentially  acousti¬ 
cally  equivalent  to  the  recognizer.  If  they  are,  they  can 
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be  merged  or  pruned  so  that  the  statistics  maintained 
by  the  language  model  will  be  well  trained.  For  clear- 
cut  cases,  the  competing  items  are  merged  as  alterna¬ 
tives  in  the  baseform  entry  for  a  single  vocabulary  unit. 
A  better  alternative  might  have  been  to  use  class  n 
grams  and  put  the  terns  into  the  same  class,  but  this 
works  fine.  For  less  clear-cut  cases,  one  item  is  simply 
deleted. 

Here  is  an  example  of  this  process  in  operation  in  the 
very  first  iteration  of  the  algorithm  after  new  vocabu¬ 
lary  items  have  been  added.  These  are  the  10-best  hy¬ 
potheses  for  the  given  utterance: 

“ what  is  the  phone  number  for  victor  zue  ” 

<oov>  phone  (n  ah  m  b  er)  (m  ih  t  er  z)  (y  uw) 

<oov>  phone  (n  ah  m  b  er)  (m  ih  t  er  z)  (z  y  uw) 

<oov>  phone  (n  ah  m  b  er)  (m  ih  t  er  z)  (uw) 

<oov>  phone  (n  ah  m  b  er)  (m  ih  t  er  z)  (z  uw) 

<oov>  phone  (ah  m  b  er  0  (m  ih  t  er  z)  (z  y  uw) 

<oov>  phone  (ah  m  b  er  f)  (m  ih  t  er  z)  (y  uw) 

<oov>  (ax  f  aa  n  ah)  (m  b  er  f  axr)  (m  ih  t  er  z)  (z  y  uw) 
<oov>  (ax  f  aa  n  ah)  (m  b  er  f  axr)  (m  ih  t  er  z)  (y  uw) 
<oov>  phone  (ah  m  b  er  f)  (m  ih  t  er  z)  (z  uw) 

<oov>  phone  (ah  m  b  er  f)  (m  ih  t  er  z)  (uw) 

The  “<oov>”  symbol  corresponds  to  an  out-of¬ 
vocabulary  sequence.  The  phone  sequences  within 
parentheses  are  uses  of  items  added  to  the  vocabulary 
in  the  last  iteration.  From  this  single  utterance,  we  ac¬ 
quire  evidence  that: 

■  The  entry  for  (ax  f  aa  n  ah)  may  be  competing 
with  the  keyword  “phone”.  If  this  holds  up  statisti¬ 
cally  across  all  the  utterances,  the  entry  will  be  de¬ 
stroyed.  The  keyword  vocabulary  is  given  spe¬ 
cial  status,  since  they  represent  a  link  to  the 
outside  world  that  should  not  be  modified. 

■  (n  ah  m  b  er),  (m  b  er  f  axr)  and  (ah  m  b  er  f) 
may  be  competing.  They  are  compared  against 
each  other  because  all  of  them  are  followed  by 
the  same  sequence  (m  ih  t  er  z)  and  many  of 
them  are  preceded  by  the  same  word  “phone”. 

■  (y  uw),  (z  y  uw),  and  (uw)  may  be  competing 

All  of  these  will  be  patched  up  for  the  next  iteration. 
Section  3  shows  stable  baseforms  created  through  this 
process. 

This  use  of  the  N-best  utterance  hypotheses  is  reminis¬ 
cent  of  their  application  to  computing  a  measure  of 
recognition  confidence  in  [3]. 


2.5  Testing  for  convergence 

For  any  iterative  procedure,  it  is  important  t)  know 
when  to  stop.  If  we  have  transcribed  data,  we  can  track 
the  keyword  error  rate  on  that  data  and  halt  when  the 
increment  in  performance  is  sufficiently  small. 

If  there  is  no  transcribed  data,  then  we  cannot  directly 
measure  the  error  rate.  We  can  however  bound  the  rate 
at  which  it  is  changing  by  comparing  keyword  loca¬ 
tions  in  the  output  of  the  recognizer  between  iterations. 
If  few  keywords  are  shifting  location,  then  the  error 
rate  cannot  be  changing  above  a  certain  bound.  We 
can  therefore  place  a  convergence  criterion  on  this 
bound  rather  than  on  the  actual  keyword  error  rate.  It 
is  important  to  just  measure  changes  in  keyword  loca¬ 
tions,  and  not  changes  in  vocabulary  items  added  by 
clustering.  Items  that  do  not  occur  often  tend  to  be 
destroyed  and  rediscovered  continuously,  making 
comparisons  difficult. 


3  Qualitative  Results 

This  section  describes,  through  examples,  the  kinds  of 
vocabulary  discovered  by  the  clustering  procedure. 
Numerical,  performance-related  results  are  reported  in 
Section  4. 

Results  given  here  are  from  a  clustering  session  with 
an  initial  vocabulary  of  five  keywords  (email, 
phone,  room,  office,  address),  run  on  the  train¬ 
ing  data,  and  not  using  the  transcripts  for  that  data  at 
all. 


Here  are  the  top  10  clusters  discovered  on  this  very 
typical  run,  ranked  by  decreasing  frequency  of  occur¬ 
rence: 


1  n  ah  m  b  er 

2  w  eh  r  ih  z 

3  w  ah  t  ih  z 

4  t  eh  1  m  iy 

5  k  ix  n  y  uw 


6  p  1  iy  z 

7  ae  ng  k  y  uw 

8  n  ow 

9  hh  aw  ax  b  aw 

10  gruwp 


These  clusters  are  used  consistently  by  the  recognizer 
in  places  corresponding  to:  “number,  wherejs, 
what_is,  tell_me,  can_you,  please,  thank_you,  no, 
how_about,  group,”  respectively.  The  first, 
/n  ah  m  b  er/,  is  very  frequent  because  of  “phone  num¬ 
ber”,  “room  number”,  and  “office  number”.  Once  it 
appears  as  a  cluster  the  language  model  is  immediately 
able  to  improve  recognition  performance  on  those 
keywords. 


3 


The  word  groups  picked  out  are  actually  rather  like  the 
merged  words  often  placed  in  a  conventional  lexicon  - 
“where_is”,  “what_is”  etc. 

Other  high-frequency  clusters  correspond  to  common 
first  names  (Karen,  Michael).  Victor  Zue, 
/ih  t  er  z  uw/,  and  Jim  Glass,  /jh  ih  n  b  ae  s/,  get  clus¬ 
ters  all  to  themselves.  Note  the  loss  of  the  initial  fric  a- 
tive  in  Victor  -  this  is  typical  (see  also  the  rendering  of 
thank_you  as  /ae  ng  k  y  uw/).  This  may  be  partially 
due  to  the  characteristics  of  speech  over  a  phone  line, 
where  much  of  the  high  frequency  component  is  lost. 
The  remaining  clusters  are  less  likely  to  correspond  to 
anything  meaningful  and  have  little  effect  on  recogni¬ 
tion  performance.  Parts  of  people’s  names  are  com¬ 
mon. 

Curiously  the  cluster  corresponding  to  yes,  /y  eh  s/, 
consistently  takes  longer  to  appear  and  is  lower  in  fre¬ 
quency  than  no,  /n  ow/,  which  is  very  frequent.  Possi¬ 
bly  people  were  saying  “no!”  to  the  early  phone-in 
recognizer  much  more  than  they  were  saying  “yes!” 

Every  now  and  then  a  “parasite”  appears  such  as 
/dh  ax  f  ow  n/  (from  an  instance  of  “the  phone”  that  the 
recognizer  fails  to  spot)  or  /iy  n  eh  1/  (from  “email”). 
These  have  the  potential  to  interfere  with  the  detection 
of  the  keywords  they  resemble  acoustically.  But  as 
soon  as  they  have  any  success,  they  are  detected  and 
eliminated  as  Ascribed  in  Section  2.4.  It  is  possble 
that  if  a  parasite  doesn’t  get  greedy,  and  for  example 
limits  itself  to  one  person’s  pronunciation  of  a  key¬ 
word,  that  it  will  not  be  detected,  although  I  didn’t  see 
any  examples  of  this  happening. 

Many  simple  sentences  can  be  modeled  completely 
after  clustering,  without  need  to  fall  back  on  the  ge¬ 
neric  OOV  phone  model.  For  example,  the  utterances: 

What  is  Victor  Zue’s  room  number 

Please  connect  me  to  Leigh  Deacon 

are  recognized  as: 

(w  ah  t  ih  z)  (ih  t  er  z  uw)  room  (n  ah  m  b  er) 

(p  1  iy  z)  (k  ix  n  eh  k)  (m  iy  t  uw)  (1  iy  d  iy)  (k  ix  n) 

All  of  which  are  entries  in  the  vocabulary  and  so  con¬ 
tribute  to  the  language  model.  All  the  discovered  vo¬ 
cabulary  items  are  assigned  one  or  more  baseforms  as 
described  in  Section  2.4.  These  baseforms  often  cover 
trivial  variations  in  a  feature  of  one  or  two  phones.  For 
example,  following  the  format  of  the  baseforms  file  we 
have: 

t_eh_l_m_iy:  ( t  eh  1  m  iy  ,  d  eh  1  m  iy  ) 


p J  Jy_z:  ( p  1  iy  z  ,  p  1  iy  s  ) 

w_er_k:  (werk,  waork) 

Other  baseforms  contain  more  variation: 

n_ah_m_b_er:  (nahmber,  ahmber, 

en  ah  m  b  er ) 

w_ah_t_ih_z:  (wahtihz,  wahdihz, 

w  ah  t  s  ,  w  ah  t  s  t , 
wahter,  w  ah  s  dh  ax  ) 

The  nasal  in  /nahmb  er/  is  sometimes  recognized, 
sometimes  not,  so  both  pronunciations  are  added  to  a 
single  baseform.  Short,  often  unstressed  words  such  as 
the  definite  and  indefinite  articles  are  not  clustered  by 
the  algorithm.  Their  influence  instead  appears  in  base- 
forms,  for  example  the  /w  ah  s  dh  ax/  entry  above. 

4  Quantitative  Results 

For  experiments  involving  small  vocabularies,  it  is  ap¬ 
propriate  to  measure  performance  in  terms  of  Keyword 
Error  Rate  (KER).  I  take  this  to  be: 

(  F  +  M  \ 

KER  =  \ — - — xlOO  Wo  ,  with: 

F  :  Number  of  false  or  poorly  localized  detections 
M  :  Number  of  missed  detections 
T  :  True  number  of  keyword  occurrences  in  data 

A  detection  is  only  counted  as  such  if  it  occurs  at  the 
right  time.  Specifically,  the  midpoint  of  the  hypothe¬ 
sized  time  interval  must  lie  within  the  true  time  interval 
the  keyword  occupies.  I  take  forced  alignments  of  the 
test  set  as  ground  truth.  This  means  that  for  testing  it  is 
better  to  omit  utterances  with  artifacts  and  words  out¬ 
side  the  full  vocabulary,  so  that  the  forced  alignment  is 
likely  to  be  sufficiently  precise. 

The  experiments  here  are  designed  to  identify  when 
clustering  leads  to  reduced  error  rates  on  a  keyword 
vocabulary.  Since  the  form  of  clustering  addressed  in 
this  paper  is  fundamentally  about  extending  the  vo¬ 
cabulary,  we  would  expect  it  to  be  useless  if  the  vo¬ 
cabulary  is  already  large  enough  to  give  good  cover¬ 
age.  We  would  expect  it  to  offer  the  greatest  im¬ 
provement  when  the  vocabulary  is  smallest.  To  meas¬ 
ure  the  effect  of  coverage,  the  full  vocabulary  was 
made  smaller  and  smaller  by  incrementally  removing 
the  most  infrequent  words.  A  set  of  keywords  were 
chosen  and  kept  constant  and  in  the  vocabulary  across 
all  the  experiments  so  the  results  would  not  be  con¬ 
founded  by  properties  of  the  keywords  themselves  (for 
example,  the  most  common  word  “the”  would  make  a 
very  bad  keyword  since  it  is  often  unstressed  and 
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loosely  pronounced).  The  same  set  of  keywords  were 
used  as  in  Section  3. 

Clustering  is  again  performed  without  making  any  use 
of  transcripts.  To  truly  eliminate  any  dependence  on 
the  transcripts,  an  acoustic  model  trained  only  on  Pega¬ 
sus  data  was  used.  This  reduced  performance  but 
made  it  easier  to  interpret  the  results. 

Figure  2  show  a  plot  of  error  rates  on  the  test  data  as 
the  size  of  the  vocabulary  is  varied  to  provide  different 
degrees  of  coverage.  The  most  striking  result  is  that 
the  clustering  mechanism  reduces  the  sensitivity  of 
performance  to  drops  in  coverage.  In  this  scenario,  the 
error  rate  achieved  with  the  full  vocabulary  (which 
gives  84.5%  coverage  on  the  trailing  data)  is  33.3%. 
When  the  coverage  is  low,  the  clustered  solution  error 
rate  remains  under  50%  —  in  relative  terms,  the  error 
increases  by  at  most  a  half  of  its  best  value.  Straight 
application  of  a  language  model  gives  error  rates  that 
more  than  double  or  treble  the  error  rate. 
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Figure  2:  Keyword  error  rate  of  baseline  recognizer 
and  clustering  recognizer  as  total  coverage  varies. 

As  a  reference  point,  the  keyword  error  rate  using  a 
language  model  trained  with  the  full  vocabulary  on  the 
full  set  of  transcriptions  with  an  acoustic  model  trained 
on  all  available  data  gives  an  8.3%  KER. 

5  Conclusions 

Speech  recognizers  are  painstakingly  engineered  to 
factor  all  available  data  into  making  judgments  of 
acoustic  similarity.  This  makes  them  a  natural  tool  for 
clustering  acoustic  data  that  is  hard  to  beat  (and  I 
tried).  A  recognizer  that  generates  N-best  hypotheses 
is  particularly  suited  to  this  task. 

The  clustering  mechanism  described  in  this  paper  can 
build  a  language  model  based  on  untranscribed  data. 


In  the  interval  between  the  start  of  acoustic  data  collec¬ 
tion  and  the  point  at  which  enough  data  has  been  tran¬ 
scribed  to  provide  reasonable  coverage,  clustering  has 
the  potential  to  boost  performance.  This  might  be  use¬ 
ful  in  off-the-shelf  systems  designed  for  non-experts, 
so  that  the  user  sees  a  quicker  return  on  their  efforts. 

An  important  issue  not  touched  on  at  all  here  is 
whether  it  is  possible  to  train  an  acoustic  model  from 
untranscribed  data.  This  seems  a  much  harder  prob¬ 
lem.  But  in  the  low-coverage  regime  clustering  is 
aimed  at,  the  language  model  is  likely  to  be  the  limit¬ 
ing  factor  to  performance. 
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